Recover squash-merged PR commits#30
Merged
Merged
Conversation
…/rationale A squash merge collapses a feature branch's commits into one commit on the default branch. WhyGraph already ingests the lost narrative (commit_titles, comments on the PullRequest row) but dropped both at the final serialization step. Stage 0 stops dropping them — no schema change. - mcp/evidence.py: _pr_dict now emits commit_titles + comments (uncapped — the evidence tool's consumer is an agent that handles the full lists). - analyze/rationale_generator.py: _format_pr renders a "Squashed commits" roster and a "Discussion" block, clipped by _PrRenderCaps. Caps are threaded purely from RationaleConfig via the generator (no get_config reach-in in the module-level formatters). - core/config.py + whygraph.example.toml: three [rationale] rendering caps (pr_roster_max_commits=30, pr_discussion_max_comments=20, pr_comment_max_chars=500) with the same >=1 validation as max_diff_chars.
Step 2 of the squash-merge recovery plan. Adds an int 0/1 on_default_branch
column to commit (default 1, server_default text("1")) so PR-origin commits
recovered from squash-merged PRs can be flagged 0 and kept out of the
main-walk-only queries (area-history, refactor-walk).
Migration uses a plain op.add_column (native ALTER TABLE ADD COLUMN) rather
than a batch recreate: recreating commit would trip the commit_file_change
foreign key on a populated DB. Additive + server-default backfilled, so
existing rows become on_default_branch=1 and re-scans stay safe.
No new table or pr_id column — the PR<->commit link stays the existing
commit_titles/_linked_prs path (plan 4.3).
Step 3 of the squash-merge recovery plan. When a feature PR is squash-merged, its original feature-branch commits are fetched once during the remote scan and persisted as on_default_branch=0 commit rows, linked to their PR through the existing commit_titles (no link table). - scan/pr_origin_enricher.py: new PROriginEnricher crawler. Balanced gate (plan 3.3/3.5) — a merged PR is enriched when its commit_titles oids are absent from commit (squash detection) AND (the merge commit is file-bulk OR it collapsed >= pr_origin_min_commits commits). One targeted batched git fetch carries only the gated candidates' refs/pull/N/head refspecs (never the refs/pull/* wildcard), pinned under refs/whygraph/pull/*. Best-effort: a failed fetch or unreadable oid is logged and skipped, never failing the scan (plan 6.6). Idempotent across re-scans. - services/git: GitFetchRefsCmd + GitLogCommitCmd, with thin Repository.fetch_refs / Repository.commit_metadata (reuses Commit.from_git_log). - cli scan: --pr-origins/--no-pr-origins flag (default on), wired as a phase-2 sibling of analyze; gated on a resolved GitHub client so it is skipped under --no-remote. Added a panel row. - core/config: AnalyzeConfig.pr_origin_min_commits (default 5) + >=1 validation + example toml line. - mcp 4.10 guards: _boring_shas_in and area_history_commits now filter on_default_branch == 1 so recovered origin commits never leak into the main-walk-only queries (defensive — they carry no commit_file_change rows). Tests: balanced-gate unit tests, end-to-end work() with a stubbed repo (asserts only candidate refspecs, on_default_branch=0 rows, dedup of an oid shared across PRs, graceful fetch-failure degrade), real-git fetch_refs/ commit_metadata tests, config default/validation, and the 4.10 guard tests. Full suite: 489 passed.
Step 5 (final) of the squash-merge recovery plan. When a queried line blames to a squash-merged PR's commit that Stage 1 enriched, collect_evidence now re-blames the same range at the PR's head_sha so each line maps back to the original feature-branch commit that authored it — surfaced as source="pr-origin". - mcp/evidence.py: _attribute_squash_origins re-blames at head_sha, modelled on _predecessor_blame (same rev=-blame call, same best-effort per-PR GitError swallow). _enriched_squash_prs_for gates on the correct §4.8 predicate: blame SHA == a PR's merge_commit_sha AND that PR has >= 1 on_default_branch=0 origin row — gate-agnostic, so it covers both file-bulk and commit-rich enriched squashes. Hunks feed the existing labeled_hunks list, so dedupe / priority / cap machinery is unchanged. - _SOURCE_PRIORITY: pr-origin = 0.5 (just below blame=0); a real authoring commit reached through the squash beats every weaker label but loses to a direct HEAD blame hit. - Labels (§4.9): _SOURCE_LABELS entry (rationale_generator); CommitEvidence. source docstring enum (analyze/rationale); evidence.py module + collect_evidence docstrings updated four→five signals. Step 4 (lazy diffs for origin commits) needed no code: the enricher leaves llm_description NULL and origin commits are normal-sized, so backfill_evidence_descriptions already routes them through the normal / backfill_all branch, and Repository.diff resolves now the object is local. Tests: real squash repo (feature branch squashed into main, tip pinned under refs/whygraph/pull/1, branch deleted) asserting pr-origin entries match git blame head_sha; small/non-file-bulk squash variant (§4.8); graceful degrade on a GC'd head_sha; unenriched squash → no attribution; source-priority unit test. Full suite: 493 passed.
Collapse over-wrapped statements onto single lines so `ruff format --check` (run repo-wide in CI) passes. No behavior change.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Squash-merging a PR collapses its commits into a single commit on the default branch, erasing the per-commit history WhyGraph relies on for evidence and rationale. This PR recovers that lost context — surfacing the original PR commit titles and review comments, and re-attributing squashed work back to its true PR origin on a per-line basis.
Changes
commit_titlesand review comments now flow into evidence and rationale cards (mcp/evidence.py,analyze/rationale.py,analyze/rationale_generator.py).commit.on_default_branchdiscriminator (with Alembic migration) plus a scan-timepr_origin_enricherthat recovers squash-merged PR commits (scan/pr_origin_enricher.py,db/models/commit.py,cli/commands/scan.py).mcp/evidence.py,mcp/path_history.py).services/git/commands.py,services/git/repository.py).core/config.pyandwhygraph.example.toml.